claimsclaimsThis is an extract of car claims handled by AXA Seguros & Direct Seguros.
We focus on 2014 claims.
Each claim is identified with claim_id.
Some claims are described on several rows and have a multitask_id.
The other IDs are explicit :
policy_id & customer_id differ by the underlying databases customer is related to the client, policy is related to the contract. Note that 20% of both policies & customers have more than one claim.mediator_id & garage_id are the pivot the perform BI on both the bodyshop & the mediator. I suggest we focus on the bodyshop.The CCA aims at scoring bodyshops to offer agreement to the best ones and send each claim to the right bodyshop. For that we need to build features per bodyshop for the scoring : use the garage_id.
First the target variables, related to cost.
The most important one cost_total is the cost of the reparation, before tax & discount.
This cost is split in several components : - labor cost for reparation/replacement cost_rep_lab - labor cost for painting cost_pt_lab - cost of painting material cost_pt_mat - cost of spare parts cost_rep_mat
these costs are completed by : - number of spare parts used in the reparation cost_number_sparep - the amount of the discount applied to the cost_total : cost_discount - the amount of the excess (participation of the insured in the cost of the claim) cost_excess - hourly cost of labour for reparation/replacement - hours of painting time_lab_pt - hours of labor for reparation/replacement time_lab_rep
In addition we created so dummy variables related to specific spareparts needed for the reparation : - headlight check_repair_headlight - door check_repair_door - front bumper check_repair_frontbumper - back bumper check_repair_backbumper - bonnet check_repair_bonnet - mudguard check_repair_mudguard
Note that most spareparts are specific to a car brand, model & year + in the db the references and names are often wrong. Here we defined the features out of very generic spareparts.
Severity of the claim is perhaps the most important feature to anticipate the cost of a claim.
Unfortunately, nothing is filled in the database prior to the expert assessment.
We know the information exist : think for instance about the location of damages, number of vehicles involved, circumstance of the claim, speed at the moment of the claim…
Here is what we could find : - guilt of the insured claim_guilt - involves a bodily injury ? claim_injuries - type of settlement : standard or convention between insurers claim_solving - guarantee affected : coverage_affected_a - type of claim : claim_type - channel for claim openning expert_order_opening_channel - was the vehicle replaced while the reparation ? vehicle_replace - did the bodyshop proceed to a pick up and delivery ? vehicle_pnd Other features can help anticipate the cost of the claim : - sale cost of the vehicle vehicle_sale_price. A more insightful quantity is the value of the vehicle prior to the claim (adjusted with age, mileage, reparations, former claims, etc.), but we don’t have it. - vehicle brand vehicle_brand - vehicle age vehicle_age One could also use date to compute : time UW to claim, claim to assessment & assessment to final payment of reparation or retrieval of the car
Sometimes we either want to : 1) explain the gap between two types of claims 2) or measure this gap think for instance of the agreed bodyshops garage_tac or the insurance entity company solutions are 1) build a predictive model strong enough to achieve good AUC, with the constraint of good interpretability. 2) build a model interpretable, as much as possible, as “ceteris paribus”. If you use GLM, use PLS to improve the consistency of the coefficients. Read the gap on the coefficient “ceteris paribus”.
In addition we also provide geolocation of both the bodyshop that repared the claim & the customer : customer_lon, customer_lat, garage_lon, garage_lat.
With these for features we can build the euclidean distance from the bodyshop to the customer distance_customer_garage.
As we know which bodyshop was recommended by AXA for the reparation, we also compute the distance from the customer to this bodyshop distance_customer_offered_garage.
As cost also depends on the local situation, we provide garage_postal_code, the postal code of the bodyshop, to easily compute local indicators of cost.
Though we strongly recommend to use k-NN based methods to build “local” features related to cost.
load("claims_data_training.RData")
Unsupervised techniques
Geometry/topology of a dataset is a rich subject. In practice the first question one faces is : how do I get to compare factors, characters, booleans and numerics ?
The tools for putting it all together are “metrics” or “distances” : ## distance between words (see text mining session) * based on the spelling * based on the semantic * based on synthax ## distance between numerical vectors : example with 2 variables. What is the distance between a 50 years old who earns 100k and a 30 years old who earns 30k ? Should we ### Do nothing ?
- scatter plot
load("full_expert_order_geolocation.RData")
illustration=full_expert_order_all[,c("vehicle_sale_price","cost_number_sparep"),with=F]
illustration=sample(na.omit(illustration))[1:1000]
illustration$tag=c("A","B","C",rep(NA,997))
ggplot(illustration,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)
## Warning: Removed 997 rows containing missing values (geom_text).
dist(illustration[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
## 1 2
## 2 1585.431
## 3 3682.962 2097.564
This is not intuitive, is it ? - density plot
ggplot(illustration,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") +
scale_fill_gradient(low = "yellow", high = "red") +
scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
geom_density2d(colour="black") +
guides(alpha=FALSE)
### scale & center ? - scatter plot
illustration2=preProcess(illustration,method=c("center","scale"))
illustration2=predict(illustration2,illustration)
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 55 rows containing missing values (geom_point).
## Warning: Removed 997 rows containing missing values (geom_text).
dist(illustration2[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
## 1 2
## 2 1.1006297
## 3 0.8247473 1.8263455
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") +
scale_fill_gradient(low = "yellow", high = "red") +
scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
geom_density2d(colour="black") +
guides(alpha=FALSE)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 55 rows containing non-finite values (stat_density2d).
## Warning: Removed 55 rows containing non-finite values (stat_density2d).
The scale has changed so the metrics are changed but it looks the same. ### cap or remove extreme values - scatter plot
illustration2=illustration2[vehicle_sale_price%between%quantile(illustration2$vehicle_sale_price,c(0.025,0.975))]
illustration2=illustration2[cost_number_sparep%between%quantile(illustration2$cost_number_sparep,c(0.025,0.975))]
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 20 rows containing missing values (geom_point).
## Warning: Removed 929 rows containing missing values (geom_text).
dist(illustration2[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
## 1 2
## 2 1.1006297
## 3 0.8247473 1.8263455
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") +
scale_fill_gradient(low = "yellow", high = "red") +
scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
geom_density2d(colour="black") +
guides(alpha=FALSE)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 20 rows containing non-finite values (stat_density2d).
## Warning: Removed 20 rows containing non-finite values (stat_density2d).
### correct asymetry (skewness) and tails of the distribution (kurtosis) - scatter plot
temp=illustration
temp$vehicle_sale_price=log(temp$vehicle_sale_price)
illustration2=preProcess(temp,method=c("center","scale"))
illustration2=predict(illustration2,temp)
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 74 rows containing missing values (geom_point).
## Warning: Removed 997 rows containing missing values (geom_text).
dist(illustration2[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
## 1 2
## 2 1.1008248
## 3 0.8173244 1.8243297
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") +
scale_fill_gradient(low = "yellow", high = "red") +
scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
geom_density2d(colour="black") +
guides(alpha=FALSE)+xlim(-1,2)+ylim(-2,3)
## Warning: Removed 74 rows containing non-finite values (stat_density2d).
## Warning: Removed 74 rows containing non-finite values (stat_density2d).
### apply ranking ie margin transformation to make the data uniform - scatter plot
illustration2<-illustration
illustration2$vehicle_sale_price=round(rank(illustration2$vehicle_sale_price))
illustration2$vehicle_sale_price=illustration2$vehicle_sale_price/max(illustration2$vehicle_sale_price)
illustration2$cost_number_sparep=round(rank(illustration2$cost_number_sparep))
illustration2$cost_number_sparep=illustration2$cost_number_sparep/max(illustration2$cost_number_sparep)
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price,color=is.na(tag)))+geom_point()+geom_text(aes(label=tag),color="red",size=5)
## Warning: Removed 997 rows containing missing values (geom_text).
dist(illustration2[!is.na(tag)][,c("vehicle_sale_price","cost_number_sparep"),with=F])
## 1 2
## 2 0.5042787
## 3 0.1787316 0.5538086
ggplot(illustration2,aes(x=cost_number_sparep,y=vehicle_sale_price))+stat_density2d(aes(alpha=..level.., fill=..level..), size=2, bins=10, geom="polygon") +
scale_fill_gradient(low = "yellow", high = "red") +
scale_alpha(range = c(0.00, 0.5), guide = FALSE) +
geom_density2d(colour="black") +
guides(alpha=FALSE)
Box-Cox power transformation provides a general framework for automatic selection of a convenient transformation to reshape the data.
It can be applied using preProcess() function, with method BoxCox from the package caret
What are the distances between car brand or claims types ?
Are Audi, Toyota & Peugeot equidistant ?
Contrary to numerical variables, with factors, we cannot think of intuitive distances. ### GOWER metric using daisy function
illustration=full_expert_order_all[,c("coverage_affected_a","vehicle_brand","cost_number_sparep","garage_tac"),with=F]
illustration=sample(na.omit(illustration))[1:5]
library(cluster)
mix_metric=daisy(illustration,metric="gower")
illustration
## vehicle_brand garage_tac coverage_affected_a cost_number_sparep
## 1: PEUGEOT 0 own damage with excess 8
## 2: NISSAN 0 own damage with excess 2
## 3: VOLKSWAGEN 0 TPL 12
## 4: PEUGEOT 0 TPL 3
## 5: OPEL 0 TPL 22
mix_metric
## Dissimilarities :
## 1 2 3 4
## 2 0.3250
## 3 0.5500 0.6250
## 4 0.3125 0.5125 0.3625
## 5 0.6750 0.7500 0.3750 0.4875
##
## Metric : mixed ; Types = N, I, N, I
## Number of objects : 5
load data for toy examples (same data as for supervised)
load("toy_datasets.RData")
linear=rbind(linear_test,linear_train)
saturn=rbind(saturn_test,saturn_train)
moon=rbind(moon_test,moon_train)
k-means (look at the last one to open on density based algo) http://www.onmyphd.com/?p=k-means.clustering&ckattempt=1
k-medoids (more sophisticated than k-means but much slower) https://www.youtube.com/watch?v=osjd5eV4ypA
Not many parameters.
When used for classification, k is tuned to minimize the error rate, thus train & test become important as a calibration is done based on a target.
Otherwise, the parameter is tuned to fit intuitions or visual compliance (see geo-clustering)
linear$kmeans=kmeans(linear[,c("X1","X2"),with=F],centers = 2)$cluster
ggplot(data=linear,aes(x=X1,y=X2,color=factor(kmeans),shape=factor(Target)))+geom_point()
k=5
moon$kmeans=kmeans(moon[,c("X1","X2"),with=F],centers = k)$cluster
moon[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=kmeans]
## kmeans group1 group2 volume
## 1: 3 0.02719033 0.97280967 662
## 2: 2 0.95731707 0.04268293 492
## 3: 5 0.06073446 0.93926554 708
## 4: 1 0.96288660 0.03711340 485
## 5: 4 0.77182236 0.22817764 653
ggplot(data=moon,aes(x=X1,y=X2,color=factor(kmeans),shape=factor(Target)))+geom_point(size=3)
k=8
saturn$kmeans=kmeans(saturn[,c("X1","X2"),with=F],centers = k)$cluster
saturn[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=kmeans]
## kmeans group1 group2 volume
## 1: 7 0 1 284
## 2: 8 1 0 44
## 3: 2 1 0 49
## 4: 4 1 0 38
## 5: 6 1 0 39
## 6: 3 1 0 44
## 7: 5 1 0 50
## 8: 1 1 0 52
ggplot(data=saturn,aes(x=X1,y=X2,color=factor(kmeans),shape=factor(Target)))+geom_point()
Load the map backend using google API
g<-qmap("Spain",maptype="roadmap",zoom=7)
save(list="g",file="map_spain.RData")
load("map_spain.RData")
load("claims_data_training.RData")
claims[,c("garage_volume"):=list(.N),by="garage_id"]
location=unique(claims[garage_volume>10][,c("garage_id","garage_lon","garage_lat"),with=F])
cluster_kmeans=kmeans(location[,c("garage_lon","garage_lat"),with=F],centers = 10)
location$cluster=factor(cluster_kmeans$cluster)
hulls <- location[, .SD[chull(garage_lon, garage_lat)], by = cluster]
hulls <- hulls[cluster != "0",]
g+geom_point(data=location[!cluster=="0"],aes(x=garage_lon,y=garage_lat,color=cluster))+geom_polygon(data = hulls, aes(x=garage_lon,y=garage_lat,fill = cluster, group = cluster),alpha = 0.5, color = NA)
## Warning: Removed 2581 rows containing missing values (geom_point).
Lots of outliers. Is it a problem ? Or does it mean that outliers are isolated bodyshops that can’t be substituted ?
visual example & illustration of the method
http://www.cse.buffalo.edu/~jing/cse601/fa12/materials/clustering_density.pdf
define the epsilon radius & the minimum number of point within the radius.
If a point hasn’t enough points in the neighborhood, it’s considered an outlier.
The limit of dbscan is when the local density is heterogeneous (slide 15/16)
It sounds consistent because it’s based on the density of points. It fits visual interpretation of “bags” of points
It’s possible to tune it smartly with k-nearest neighbor distance where k is the “min-points” parameter.
This is done in the subsection related to PCA & outliers removal on the claims database (PCA+DBSCAN)
For a dbscan, only one epsilon (radius) is being used, but the clustering can be iterated : 1) define a list of epsilon for which you want to create cluster (k-nn dist distribution & quantiles might be insightful) 2) start with the smallest epsilon 3) the cluster 0 stands for “not clustered” (outliers), run the next dbscan (next epsilon) on these.
step 3 needs to be iterated with each epsilon from the smallest to the highest.
load("toy_datasets.RData")
linear=rbind(linear_test,linear_train)
saturn=rbind(saturn_test,saturn_train)
moon=rbind(moon_test,moon_train)
linear$dbscan=dbscan(linear[,c("X1","X2"),with=F],eps = 0.05,minPts = 10)$cluster
ggplot(data=linear,aes(x=X1,y=X2,color=factor(dbscan),shape=factor(Target)))+geom_point()
# try with eps 0.05, 0.15 (good), 0.3.
# Notice that the outliers are quite consistent, they are the "misclassified"
moon$dbscan=dbscan(moon[,c("X1","X2"),with=F],eps = 0.15,minPts = 50)$cluster
moon[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=dbscan]
## dbscan group1 group2 volume
## 1: 2 0.01152482 0.98847518 1128
## 2: 1 0.97952756 0.02047244 1270
## 3: 0 0.40863787 0.59136213 602
ggplot(data=moon,aes(x=X1,y=X2,color=factor(dbscan),shape=factor(Target)))+geom_point()
# try with eps 1, 2 (good, no outiers), 5.
saturn$dbscan=dbscan(saturn[,c("X1","X2"),with=F],eps = 2,minPts = 10)$cluster
saturn[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=dbscan]
## dbscan group1 group2 volume
## 1: 1 0 1 284
## 2: 2 1 0 314
## 3: 0 1 0 2
ggplot(data=saturn,aes(x=X1,y=X2,color=factor(dbscan),shape=factor(Target)))+geom_point()
In the end we had to accept duplication of clusters to get a good classification when our eyes tell us there are 2 clusters only.
load("bodyshop_geoclust.RData")
load("map_spain.RData")
claims[,c("garage_volume"):=list(.N),by="garage_id"]
location=unique(claims[garage_volume>10][,c("garage_id","garage_lon","garage_lat"),with=F])
cluster_dbscan=dbscan(location[,c("garage_lon","garage_lat"),with=F],eps = 0.3,minPts = 20,borderPoints = TRUE,search = "kdtree",splitRule = "SUGGEST")
location$cluster=factor(cluster_dbscan$cluster)
hulls <- location[, .SD[chull(garage_lon, garage_lat)], by = cluster]
hulls <- hulls[cluster != "0",]
g+geom_point(data=location[!cluster=="0"],aes(x=garage_lon,y=garage_lat,color=cluster))+geom_polygon(data = hulls, aes(x=garage_lon,y=garage_lat,fill = cluster, group = cluster),alpha = 0.5, color = NA)
## Warning: Removed 2445 rows containing missing values (geom_point).
consistent example on data (Gaussian) Mixture Clustering http://www.autonlab.org/tutorials/gmm14.pdf check slides 21-24
x1<-rnorm(n = 100,mean = 0,sd = .4)
y1<-rnorm(n = 100,mean = 0,sd = .4)
x2<-rnorm(n = 50,mean = 1,sd = .8)
y2<-rnorm(n = 50,mean = 2,sd = .8)
x3<-rnorm(n = 200,mean = 3,sd = .4)
y3<-rnorm(n = 200,mean = 0,sd = .8)
db=data.table(x=c(x1,x2,x3),y=c(y1,y2,y3),belong=c(rep(1,100),rep(2,50),rep(3,200)))
ggplot(data=db,aes(x=x,y=y,color=factor(belong)))+geom_point()
clust=Mclust(db,G=3)
cluster=apply(clust$z,1,which.max)
db$cluster=factor(cluster)
ggplot(data=db,aes(x=x,y=y,color=factor(cluster),shape=factor(belong)))+geom_point(size=3)
# classification quality
db[,list(mean(cluster==1),mean(cluster==2),mean(cluster==3)),by=belong]
## belong V1 V2 V3
## 1: 1 0.97 0.03 0.00
## 2: 2 0.00 0.98 0.02
## 3: 3 0.00 0.02 0.98
load("map_spain.RData")
cluster_Mclust=Mclust(location[,c("garage_lon","garage_lat"),with=F],control = emControl(tol=1.e-6),warn = TRUE,G=1:20)
summary(cluster_Mclust)
save(list="cluster_Mclust",file="Mclust.RData")
Assign to the main cluster & visualize
load("Mclust.RData")
cluster=cluster_Mclust$z
cluster[cluster<0.1]=0
cluster=apply(cluster,1,which.max)
location$cluster=factor(cluster)
## Warning in `[<-.data.table`(x, j = name, value = value): Supplied 4329 items to be assigned to 4029 items of
## column 'cluster' (300 unused)
hulls <- location[, .SD[chull(garage_lon, garage_lat)], by = cluster]
hulls <- hulls[cluster != "0",]
load("map_spain.RData")
g+geom_point(data=location[!cluster=="0"],aes(x=garage_lon,y=garage_lat,color=cluster))+geom_polygon(data = hulls, aes(x=garage_lon,y=garage_lat,fill = cluster, group = cluster),alpha = 0.5, color = NA)
## Warning: Removed 2581 rows containing missing values (geom_point).
when the points overlap and classification is really uncertain (we lack information to properly split the data)
x1<-rnorm(n = 100,mean = 0,sd = .5)
y1<-rnorm(n = 100,mean = 0,sd = .5)
z1<-rnorm(n = 100,mean = 0,sd = .5)
x2<-rnorm(n = 50,mean = 1,sd = 1)
y2<-rnorm(n = 50,mean = 2,sd = 1)
z2<-rnorm(n = 50,mean = 10,sd = 1)
x3<-rnorm(n = 200,mean = 3,sd = .5)
y3<-rnorm(n = 200,mean = 0,sd = 1)
z3<-rnorm(n = 200,mean = -10,sd = 1)
example=data.table(x=c(x1,x2,x3),y=c(y1,y2,y3),z=c(z1,z2,z3),belong=c(rep(1,100),rep(2,50),rep(3,200)))
ggplot(example,aes(x=x,y=y,color=factor(belong)))+geom_point()
plot3d(example[,c("x","y","z"),with=F],col = factor(example$belong))
Imagine we only observe x & y and what to do the most consistent classification. fuzzy clustering looks better than previous methods for this application.
If you think of the claims clustering problem or bodyshop clustering, probabilities can be used to compute more sophisticated discrepancy amongst observations based on the fuzzy clustering only.
It can be seen as a tool for dimension reduction, like PCA.
linear$Mclust=factor(apply(Mclust(linear[,c("X1","X2"),with=F])$z,1,which.max))
ggplot(data=linear,aes(x=X1,y=X2,color=factor(Mclust),shape=factor(Target)))+geom_point()
moon$Mclust=factor(apply(Mclust(moon[,c("X1","X2"),with=F])$z,1,which.max))
moon[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=Mclust]
## Mclust group1 group2 volume
## 1: 4 0.004056795 0.995943205 493
## 2: 2 0.942460317 0.057539683 504
## 3: 3 0.033402923 0.966597077 479
## 4: 1 0.024096386 0.975903614 498
## 5: 6 0.954716981 0.045283019 530
## 6: 5 0.991935484 0.008064516 496
ggplot(data=moon,aes(x=X1,y=X2,color=factor(Mclust),shape=factor(Target)))+geom_point()
saturn$Mclust=factor(apply(Mclust(saturn[,c("X1","X2"),with=F],G=1:20)$z,1,which.max))
saturn[,list(group1=mean(Target==1),group2=mean(Target==0),volume=.N),by=Mclust]
## Mclust group1 group2 volume
## 1: 1 0.000000000 1.0000000 26
## 2: 2 1.000000000 0.0000000 34
## 3: 3 1.000000000 0.0000000 48
## 4: 8 0.006711409 0.9932886 149
## 5: 5 1.000000000 0.0000000 39
## 6: 6 1.000000000 0.0000000 45
## 7: 7 0.000000000 1.0000000 55
## 8: 9 1.000000000 0.0000000 44
## 9: 10 1.000000000 0.0000000 55
## 10: 11 1.000000000 0.0000000 49
## 11: 4 0.017857143 0.9821429 56
ggplot(data=saturn,aes(x=X1,y=X2,color=factor(Mclust),shape=factor(Target)))+geom_point()
brands=full_expert_order_all[,list(cost_avg=mean(vehicle_sale_price,na.rm=T),cost_q.8=quantile(vehicle_sale_price,.8,na.rm=T),age_avg=mean(vehicle_age,na.rm=T),age_q.8=quantile(vehicle_age,.8,na.rm=T),volume=.N),by=vehicle_brand]
process=preProcess(brands,method=c("center","scale","BoxCox","pca"))
brands_processed=predict(process,brands)
load("claims_data_training.RData")
db=claims[,c("cost_total","cost_discount","cost_excess","cost_number_sparep","cost_pt_mat","cost_pt_lab","hcost_rep_lab","vehicle_sale_price","distance_customer_offered_garage","distance_customer_garage","garage_tac"),with=F]
res.pca <- PCA(db, quanti.sup = 1, quali.sup = 11,scale.unit=TRUE)
plot(res.pca, choix="var")
dimdesc(res.pca,axes=1:2)
## $Dim.1
## $Dim.1$quanti
## correlation p.value
## cost_pt_lab 0.84244005 0.000000e+00
## cost_pt_mat 0.84050798 0.000000e+00
## cost_total 0.67955013 0.000000e+00
## cost_number_sparep 0.58722469 0.000000e+00
## cost_discount 0.55041417 0.000000e+00
## cost_excess 0.53824564 0.000000e+00
## hcost_rep_lab 0.28630303 0.000000e+00
## vehicle_sale_price 0.13612663 0.000000e+00
## distance_customer_garage 0.04886999 2.149760e-136
## garage_tac 0.03926150 1.090801e-88
## distance_customer_offered_garage 0.01998288 3.018101e-24
##
##
## $Dim.2
## $Dim.2$quanti
## correlation p.value
## distance_customer_garage 0.922454564 0.000000e+00
## distance_customer_offered_garage 0.921626476 0.000000e+00
## vehicle_sale_price 0.061537427 3.243281e-215
## hcost_rep_lab 0.026788436 3.042502e-42
## cost_number_sparep 0.026192219 1.847748e-40
## cost_total 0.021033732 1.092081e-26
## cost_excess -0.007740881 8.317662e-05
## cost_discount -0.012682340 1.139361e-10
## cost_pt_lab -0.049480535 8.900211e-140
## cost_pt_mat -0.050080300 3.847503e-143
## garage_tac -0.069329705 9.793773e-273
## Selection of some individuals
plot(res.pca,select="contrib 100") #plot the 100 individuals that have the highest cos2
db_pca=data.table(res.pca$ind$coord[,1:2])
setnames(db_pca,names(db_pca),c("PC1","PC2"))
ggplot(data=sample(db_pca)[1:10000],aes(x=PC1,y=PC2))+geom_point()
# Calibration of the parameters of DBSCAN using k_dist plot
KNN=get.knn(data = sample(db_pca)[1:10000],k=100)$nn.dist[,c(1:10*10)]
k_dist=c(apply(KNN,2,function(x)quantile(x,1:90/100)))
k_dist=data.table(x=rep(1:90,10),k_dist,k=sort(rep(1:10*10,90)))
g<-ggplot(data=k_dist,aes(x=x,y=k_dist,color=factor(k)))+geom_line()
g
# http://stackoverflow.com/questions/12893492/choosing-eps-and-minpts-for-dbscan-r
# tuning the parameters of DBSCAN to find the cluster structure we prefer
# grid=expand.grid(eps=1:10*3/100,minPts=c(50,100,500),method=c("kdtree","linear"),split=c("STD","MIDPT","FAIR","SL_MIDPT","SL_FAIR","SUGGEST"))
grid=expand.grid(eps=1:30/100,minPts=c(50),method=c("kdtree"),split=c("SUGGEST"))
extract=sample(db_pca)[1:25000]
tune=pbapply(grid,1,function(x){
table(dbscan(x = extract,eps = x[1],minPts = x[2],search=x[3],splitRule = x[4])$cluster)
})
tune
## [[1]]
##
## 0
## 25000
##
## [[2]]
##
## 0
## 25000
##
## [[3]]
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13
## 22149 1618 140 57 50 247 77 280 68 107 64 50 50 43
##
## [[4]]
##
## 0 1 2 3 4 5 6 7 8 9
## 16133 7773 158 130 83 78 38 37 103 467
##
## [[5]]
##
## 0 1 2 3 4 5
## 12927 11769 106 50 107 41
##
## [[6]]
##
## 0 1
## 10468 14532
##
## [[7]]
##
## 0 1 2 3 4 5 6
## 9003 15715 60 90 20 54 58
##
## [[8]]
##
## 0 1 2 3
## 7584 17315 50 51
##
## [[9]]
##
## 0 1 2 3
## 6387 18470 75 68
##
## [[10]]
##
## 0 1 2
## 5527 19422 51
##
## [[11]]
##
## 0 1 2
## 4995 19967 38
##
## [[12]]
##
## 0 1
## 4525 20475
##
## [[13]]
##
## 0 1
## 4112 20888
##
## [[14]]
##
## 0 1
## 3854 21146
##
## [[15]]
##
## 0 1
## 3556 21444
##
## [[16]]
##
## 0 1 2
## 3267 21695 38
##
## [[17]]
##
## 0 1
## 3109 21891
##
## [[18]]
##
## 0 1
## 2898 22102
##
## [[19]]
##
## 0 1 2
## 2649 22301 50
##
## [[20]]
##
## 0 1
## 2506 22494
##
## [[21]]
##
## 0 1
## 2385 22615
##
## [[22]]
##
## 0 1
## 2285 22715
##
## [[23]]
##
## 0 1
## 2187 22813
##
## [[24]]
##
## 0 1 2
## 2050 22889 61
##
## [[25]]
##
## 0 1
## 1964 23036
##
## [[26]]
##
## 0 1
## 1863 23137
##
## [[27]]
##
## 0 1
## 1784 23216
##
## [[28]]
##
## 0 1
## 1672 23328
##
## [[29]]
##
## 0 1 2
## 1535 23415 50
##
## [[30]]
##
## 0 1 2
## 1460 23486 54
# first iteration
outliers=dbscan(extract,eps=0.5,minPts = 500)
table(outliers$cluster)
##
## 0 1
## 2762 22238
extract$cluster=outliers$cluster
ggplot(data=extract,aes(x=PC1,y=PC2,color=factor(cluster)))+geom_point()
# second iteration
extract_2=extract[cluster==0]
outliers=dbscan(extract_2,eps=1,minPts = 100)
table(outliers$cluster)
##
## 0 1
## 380 2382
extract_2$cluster=outliers$cluster
ggplot(data=extract_2,aes(x=PC1,y=PC2,color=factor(cluster)))+geom_point()
# second iteration
extract_3=extract_2[cluster==0]
outliers=dbscan(extract_3,eps=2,minPts = 50)
table(outliers$cluster)
##
## 0 1 2
## 57 246 77
extract_3$cluster=outliers$cluster
ggplot(data=extract_3,aes(x=PC1,y=PC2,color=factor(cluster)))+geom_point()
# now removes these 0.2% outliers.
# You can check the derived & improved methods : OPTICS
A LITTLE THEORY :
compute a distance between observation. http://www.analytictech.com/networks/hiclus.htm
Given a set of N items to be clustered, and an NxN distance (or similarity) matrix, the basic process of Johnson’s (1967) hierarchical clustering is this:
The question is, when assigning singleton to clusters, what metric should be used ? - minimum distance to points of the cluster - median distance to points of the cluster - distance to the medoid of the cluster - maximum distance to points of the cluster
APPLICATIONS
See the plots & cut the tree (hierarchy) + viz with PCA
Application to geoclustering
load("map_spain.RData")
cluster_hierchical=hclust(dist(location[,c("garage_lon","garage_lat"),with=F]))
plot(as.phylo(cluster_hierchical),type="fan",cex=0.5)
memb<-cutree(cluster_hierchical,k=25)
location$cluster=factor(memb)
hulls <- location[, .SD[chull(garage_lon, garage_lat)], by = cluster]
hulls <- hulls[cluster != "0",]
g+geom_point(data=location[!cluster=="0"],aes(x=garage_lon,y=garage_lat,color=cluster))+geom_polygon(data = hulls, aes(x=garage_lon,y=garage_lat,fill = cluster, group = cluster),alpha = 0.5, color = NA)
## Warning: Removed 2581 rows containing missing values (geom_point).
Application to brand classification
cluster_hierchical=hclust(dist(brands_processed[,c("PC1","PC2","PC3"),with=F]))
plot(cluster_hierchical,labels =brands_processed$vehicle_brand)
memb<-cutree(cluster_hierchical,k=5)
brands_processed$hcluster=factor(memb)
brands_processed[,list(paste(vehicle_brand,collapse=",")),by=hcluster]
## hcluster
## 1: 1
## 2: 2
## 3: 3
## 4: 4
## 5: 5
## V1
## 1: PEUGEOT,NISSAN,VOLKSWAGEN,OPEL,RENAULT,BMW,FIAT,FORD,CITROEN,AUDI,MERCEDES,NA,SUZUKI,SEAT,VOLVO,MITSUBISHI,CHRYSLER,ALFA ROMEO,JEEP-CHRYSLER,SAAB,ROVER
## 2: MAZDA,HYUNDAI,HONDA,TOYOTA,MINI,KIA,CHEVROLET-GM,SKODA
## 3: SSANGYONG,LEXUS,JAGUAR,LAND ROVER,PORSCHE
## 4: DACIA
## 5: DAEWOO
Visualization using PCA
hulls <- brands_processed[, .SD[chull(PC1,PC2)], by = hcluster]
ggplot()+geom_point(data=brands_processed,aes(x=PC1,y=PC2,color=hcluster))+geom_polygon(data = hulls, aes(x=PC1,y=PC2,fill = hcluster, group = hcluster),alpha = 0.5, color = NA)+geom_text(data=brands_processed,aes(x=PC1,y=PC2,label=vehicle_brand,color=hcluster))